fix: Make XML parsing more robust with regex fallback (#455) by vivekvar-dl · Pull Request #473 · SylphAI-Inc/AdalFlow

vivekvar-dl · 2026-03-15T14:41:29Z

Summary

This PR addresses issue #455 by making XML parsing in the TGD optimizer more robust when handling malformed output from LLMs like Gemini Flash 2.5.

Problem

The current XML parser (CustomizedXMLParser in tgd_optimizer.py) fails completely when LLMs produce malformed XML, leading to loss of the proposed_variable and other important fields. Users reported this was more frequent with Gemini Flash 2.5.

Solution

This PR implements a multi-layered approach to XML parsing:

XML Sanitization - Removes invalid control characters and handles CDATA sections before parsing
Regex Fallback - When ET.fromstring fails, falls back to regex-based extraction
Better Text Extraction - Improved handling of nested XML elements and text content
Graceful Error Recovery - Returns extracted content instead of error placeholders

Changes

Added _sanitize_xml() method to clean invalid characters and extract CDATA content
Added _extract_with_regex() fallback parser using regex patterns
Enhanced get_element_text() helper to handle nested elements properly
Improved error handling with informative logging via log.warning() and log.info()

Testing

Tested with various malformed inputs:

✅ Unclosed tags
✅ Invalid control characters
✅ CDATA sections
✅ Missing wrapper elements
✅ Nested XML in content
✅ Completely broken XML

All test cases now successfully extract the proposed_variable field.

Backward Compatibility

✅ Fully backward compatible - well-formed XML still parses via the standard XML parser. The regex fallback only activates when XML parsing fails.

Fixes #455

- Add XML sanitization to remove invalid control characters - Handle CDATA sections properly by extracting content - Implement regex-based fallback when ET.fromstring fails - Improve text extraction to handle nested XML elements - Add comprehensive error recovery for malformed LLM output This addresses issue SylphAI-Inc#455 where XML parsing was failing on malformed output from Gemini Flash 2.5. The parser now gracefully falls back to regex extraction when strict XML parsing fails, ensuring that the proposed_variable and other fields are still extracted correctly. The changes include: 1. _sanitize_xml() method to clean invalid characters and CDATA 2. _extract_with_regex() fallback parser using regex patterns 3. Enhanced get_element_text() to handle nested elements 4. Better error handling with informative logging Tested with various malformed inputs including unclosed tags, invalid characters, CDATA sections, and completely broken XML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Make XML parsing more robust with regex fallback (#455)#473

fix: Make XML parsing more robust with regex fallback (#455)#473
vivekvar-dl wants to merge 1 commit intoSylphAI-Inc:mainfrom
vivekvar-dl:fix-xml-parsing-issue-455

vivekvar-dl commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vivekvar-dl commented Mar 15, 2026

Summary

Problem

Solution

Changes

Testing

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant